Data-driven Language Independent Word Segmentation Using Character-Level Information

نویسندگان

  • Dong-Hee Lim
  • Seung-Shik Kang
چکیده

This paper presents a data-driven language independent word segmentation system that has been trained for Chinese corpus at the second Chinese word segmentation bakeoff. The system consists of a base segmentation algorithm and the refining procedures for the undecided character sequences. It does not use any lexicon and the base segmentation is simply done by character bigram and HMM-model is applied for the remaining character sequences. As a final step, high-frequency character trigram modifies the error-prone parts of the text.TT

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Challenging Language-Dependent Segmentation for Arabic: An Application to Machine Translation and Part-of-Speech Tagging

Word segmentation plays a pivotal role in improving any Arabic NLP application. Therefore, a lot of research has been spent in improving its accuracy. Off-the-shelf tools, however, are: i) complicated to use and ii) domain/dialect dependent. We explore three language-independent alternatives to morphological segmentation using: i) data-driven sub-word units, ii) characters as a unit of learning...

متن کامل

Waterloo at NTCIR-3: Using Self-supervised Word Segmentation

In this paper, we describe the system we use in the NTCIR-3 CLIR (cross language IR) task. We participate the SLIR (single language IR) track. In our system, we use a self-supervised word-segmentation technique for Chinese information retrieval, which combines the advantages of traditional dictionary based approaches with character based approaches, while overcoming many of their shortcomings. ...

متن کامل

Chinese and Japanese Word Segmentation Using Word-Level and Character-Level Information

In this paper, we present a hybrid method for Chinese and Japanese word segmentation. Word-level information is useful for analysis of known words, while character-level information is useful for analysis of unknown words, and the method utilizes both these two types of information in order to effectively handle known and unknown words. Experimental results show that this method achieves high o...

متن کامل

Text segmentation with character-level text embeddings

Learning word representations has recently seen much success in computational linguistics. However, assuming sequences of word tokens as input to linguistic analysis is often unjustified. For many languages word segmentation is a non-trivial task and naturally occurring text is sometimes a mixture of natural language strings and other character data. We propose to learn text representations dir...

متن کامل

Word Segmentation for Urdu OCR System

This paper presents a technique for Word segmentation for the Urdu OCR system. Word segmentation or word tokenization is a preliminary task for understanding the meanings of sentences in Urdu language processing. Several techniques are available for word segmentation in other languages but not much work has been done for word segmentation of Urdu Optical Character Recognition (OCR) System. A me...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005